-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize kwargs for LLaVa processor and update docs #32858
Uniformize kwargs for LLaVa processor and update docs #32858
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, LGTM!
Left one nit with docstring and we'd need to swap processor args order in test_modeling_llava.py
also
# check if images and text inputs are reversed for BC | ||
if ( | ||
text is not None | ||
and not isinstance(text[0], str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text
could also just be an str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check should still work in that case, except if we have an empty string. I will try to think of something cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @zucchini-nlp @leloykun ! Here is a revised version of backward compatibility handling for reversed images
and text
inputs. the main addition is the _check_reversed_images_text
function in processing_utils.py
which should better detect if inputs need to be reversed, and can be used by all processors where images
and text
inputs have been swapped.
src/transformers/processing_utils.py
Outdated
@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False): | |||
return_offsets_mapping: Optional[bool] | |||
return_length: Optional[bool] | |||
verbose: Optional[bool] | |||
padding_side: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed padding_side in this PR as it should be one of the first uniformization PR to get merged. This shouldn't break anything (see #32544 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will break things in the sense that people who were previously passing in padding_side
to the processor (even if it had no effect) would now experience an error, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of llava at least, even with padding_side in the base TextKwargs, if the user pass in padding_side to a processor call, there will be an error as the tokenizer inherit from PreTrainedTokenizerFast
, and the call function _batch_encode_plus
doesn't take in padding_side
nor **kwargs
:
transformers/src/transformers/tokenization_utils_fast.py
Lines 491 to 512 in 27903de
def _batch_encode_plus( | |
self, | |
batch_text_or_text_pairs: Union[ | |
List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair] | |
], | |
add_special_tokens: bool = True, | |
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, | |
truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, | |
max_length: Optional[int] = None, | |
stride: int = 0, | |
is_split_into_words: bool = False, | |
pad_to_multiple_of: Optional[int] = None, | |
return_tensors: Optional[str] = None, | |
return_token_type_ids: Optional[bool] = None, | |
return_attention_mask: Optional[bool] = None, | |
return_overflowing_tokens: bool = False, | |
return_special_tokens_mask: bool = False, | |
return_offsets_mapping: bool = False, | |
return_length: bool = False, | |
verbose: bool = True, | |
split_special_tokens: bool = False, | |
) -> BatchEncoding: |
On that point, if we don't accept kwargs for this function, shouldn't we restrict the kwargs passed in to the processor's modality processors classes? I guess this would require a bit of refactoring of the base processor class, but right now, if a user pass in a kwarg not supported in any ModalityKwarg
class, it will be passed as a "common" kwarg to all of the modality processors. If the call functions of the modality processor accepts kwargs, the kwarg will just be ignored, but in the case of this _batch_encode_plus
function (there might be others) there will be an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp
In this case, if padding_side
can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs
and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we are welcome to acceptring padding_side
as kwargs in tokenozers (#30447) but I couldn't find a PR for that. I totally agree that it's a nice feature to have and looks easy to implement. I can work in it next week so that we son't have to drop the kwarg in processors. The only thing is that it will be kinda BC breaking, because earlier we ignored the kwarg and now we'll use it to pad on the correct side. WDYT? @amyeroberts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp Sounds good to me! Even if it's breaking, I think it's breaking in the right way: correcting a surprising behaviour
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added back padding_side
to this PR. Waiting on padding_side
to be accepted as kwargs in tokenizers before merging this PR then. Could you ping this PR or me when you open a PR @zucchini-nlp ? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is it #33385
src/transformers/processing_utils.py
Outdated
@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg | |||
output_kwargs[modality].update(output_kwargs["common_kwargs"]) | |||
return output_kwargs | |||
|
|||
def _check_reversed_images_text(self, images, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function might be too problem-specific to be included in the base processor class, but since several VLMs are affected by this issue and it’s only a temporary solution pending deprecation, I thought it might make sense to include it here. It seems more practical than having the function copied across multiple processors, and should also simplify the deprecation process. Of course, If you think there's a better location for this or another approach that would be more suitable, I’m open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just spoke with Amy, and a good middle-ground solution would be to keep this function in processing_utils
, but outside of any class, so that models that need it would explicitly import it from processing_utils
. This approach avoids cluttering the base Processor class with a very problem-specific function and limits the diffs when adding and deprecating this behavior.
src/transformers/processing_utils.py
Outdated
"You may have used the wrong order for inputs. `images` should be passed before `text`. " | ||
"The `images` and `text` inputs will be swapped." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include info about deprecating this behavior in a future version here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, 2-3 major versions from current should be enough
src/transformers/processing_utils.py
Outdated
"You may have used the wrong order for inputs. `images` should be passed before `text`. " | ||
"The `images` and `text` inputs will be swapped." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes please, 2-3 major versions from current should be enough
src/transformers/processing_utils.py
Outdated
elif isinstance(t[0], (list, tuple)): | ||
# ... list with an empty list or with a list of strings | ||
return len(t[0]) == 0 or isinstance(t[0][0], str) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think we work with nested list so texts in processors, that's usually for tokenizers with text-pair. Also the empty list, is it a valid input type?
What can be a valid type is a encoded text, which is missing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took an existing _is_valid_text_input
function that is defined in several tokenizers, but you are right it should probably be adapted a bit for processors.
Some vlms for object detection do use nested lists of text such as Owlv2 or OmDet-Turbo, so probably better to keep this.
Encoded text are lists of int right? I will add them. Although it seems we usually don't advertise EncodedInput
as an acceptable input type for text
in processor, so I don't know if that is in purpose or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also reuse the is_valid_image
util
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm using valid_images
which recursively uses is_valid_image
to check for nested images
src/transformers/processing_utils.py
Outdated
@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg | |||
output_kwargs[modality].update(output_kwargs["common_kwargs"]) | |||
return output_kwargs | |||
|
|||
def _check_reversed_images_text(self, images, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding - looking good!
Main comment is to make sure there are tests for the image,text inversion logic
src/transformers/processing_utils.py
Outdated
@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False): | |||
return_offsets_mapping: Optional[bool] | |||
return_length: Optional[bool] | |||
verbose: Optional[bool] | |||
padding_side: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will break things in the sense that people who were previously passing in padding_side
to the processor (even if it had no effect) would now experience an error, I think?
src/transformers/processing_utils.py
Outdated
@@ -993,6 +990,59 @@ def apply_chat_template( | |||
) | |||
|
|||
|
|||
def _check_reversed_images_text_for_vlms(images, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour should be tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts since this function is in processing_utils, where would the best place be to test it? Or should we create model-specific tests for all models using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there isn't a module for testing utils in processing_utils then we should add one!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Changes to llava all look OK to me. I have one question about the tests for valid text inputs.
Main comment is that we should split up the changes to llava and the addition of the verification of the input order for the processors.
tests/utils/test_processing_utils.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
tests/utils/test_processing_utils.py
Outdated
self.assertIsInstance(valid_images, torch.Tensor) | ||
self.assertEqual(valid_text, text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit funny - why check the value of one and the instance type of another?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use assert functions from unittest.TestCase
, and assertEqual doesn't work with tensors or numpy arrays. Also I thought since only image inputs can be tensors/np arrays, if they are indeed tensors/np arrays it means that the switch (or not) went ok. But I agree it's not very consistent with the rest, and doesn't check if the function modified the inputs during the switch.
I could use self.assertTrue(torch.equal(...))
and self.assertTrue(np.array_equal(...))
instead?
src/transformers/processing_utils.py
Outdated
@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False): | |||
return_offsets_mapping: Optional[bool] | |||
return_length: Optional[bool] | |||
verbose: Optional[bool] | |||
padding_side: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp
In this case, if padding_side
can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs
and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed
146a50a
to
2a7016d
Compare
…eprecation version in warning
2a7016d
to
4dc7ada
Compare
Now that #33385 has been merged, this should be ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for making our processors nice and uniform!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Catching up on reviews, just saw @amyeroberts already approved but I can add my opinion and approve as well 😁 thanks a lot for working on this. I'll get back to the other pending PRs to help the effort as well
) * Uniformize kwargs for LlaVa and update docs * Change order of processor inputs in docstring * Improve BC support for reversed images and text inputs * cleanup llava processor call docstring * Add encoded inputs as valid text inputs in reverse input check, add deprecation version in warning * Put function check reversed images text outside base processor class * Refactor _validate_images_text_input_order * Add ProcessingUtilTester * fix processing and test_processing
) * Uniformize kwargs for LlaVa and update docs * Change order of processor inputs in docstring * Improve BC support for reversed images and text inputs * cleanup llava processor call docstring * Add encoded inputs as valid text inputs in reverse input check, add deprecation version in warning * Put function check reversed images text outside base processor class * Refactor _validate_images_text_input_order * Add ProcessingUtilTester * fix processing and test_processing
What does this PR do?
Adds uniformized processors kwargs following #31911 for LLaVa
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@molbap @zucchini-nlp